In this lab, we are going to work with vector data, which we’ve talked about last week. First, we are going to geocode addresses. We are going to work with a dataset of cancer patients across California, as well as Census data on socioeconomic factors. We will also talk about how to visualize data.
The objectives of this guide are to teach you:
Let’s get cracking!
We hopefully remember some of this from last week in Lab 2, but let’s open an R Markdown file by clicking on File at the top menu in RStudio, select New File, and then R Markdown…. A window should pop up. In that window, for title, put in “Lab 3”. For author, put your name. Leave the HTML radio button clicked, and select OK. A new R Markdown file should pop up in the top left window.
Let’s load some packages that we will need this week. We need to load
any packages we previously installed using the function
library(). Remember, install once, load every time. And if
it gives you an error for no package called..., then we
need to install those packages using install.packages(). So
when using a package, library() should always be at the top
of your R Markdown.
…but here’s a shortcut way to check and install packages you don’t have, then load them up!
packages <- c("tidygeocoder", "sf", "terra", "MapGAM", "tidyverse", "tidycensus", "flextable", "tmap")
installed <- packages %in% rownames(installed.packages())
if (any(!installed)) {
install.packages(packages[!installed])
}
lapply(packages, library, character.only = TRUE)
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
## terra 1.8.29
## Loading required package: sp
## Loading required package: gam
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-5
## Loading required package: survival
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ tidyr::extract() masks terra::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::when() masks foreach::when()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'flextable'
##
##
## The following object is masked from 'package:purrr':
##
## compose
##
##
## The following objects are masked from 'package:terra':
##
## align, colorize, rotate, width
## [[1]]
## [1] "tidygeocoder" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[2]]
## [1] "sf" "tidygeocoder" "stats" "graphics" "grDevices"
## [6] "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "terra" "sf" "tidygeocoder" "stats" "graphics"
## [6] "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "MapGAM" "survival" "gam" "foreach" "splines"
## [6] "sp" "terra" "sf" "tidygeocoder" "stats"
## [11] "graphics" "grDevices" "utils" "datasets" "methods"
## [16] "base"
##
## [[5]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [6] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [11] "MapGAM" "survival" "gam" "foreach" "splines"
## [16] "sp" "terra" "sf" "tidygeocoder" "stats"
## [21] "graphics" "grDevices" "utils" "datasets" "methods"
## [26] "base"
##
## [[6]]
## [1] "tidycensus" "lubridate" "forcats" "stringr" "dplyr"
## [6] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [11] "tidyverse" "MapGAM" "survival" "gam" "foreach"
## [16] "splines" "sp" "terra" "sf" "tidygeocoder"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[7]]
## [1] "flextable" "tidycensus" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "MapGAM" "survival" "gam"
## [16] "foreach" "splines" "sp" "terra" "sf"
## [21] "tidygeocoder" "stats" "graphics" "grDevices" "utils"
## [26] "datasets" "methods" "base"
##
## [[8]]
## [1] "tmap" "flextable" "tidycensus" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "MapGAM" "survival"
## [16] "gam" "foreach" "splines" "sp" "terra"
## [21] "sf" "tidygeocoder" "stats" "graphics" "grDevices"
## [26] "utils" "datasets" "methods" "base"
And here’s the traditional way we would have loaded them (don’t need this now though…)
library(tidygeocoder)
library(sf)
library(MapGAM)
library(tidyverse)
library(tidycensus)
library(flextable)
library(tmap)
First, we are going to tackle how we take addresses and convert them to spatial data (latitude and longitude). So, let’s say we wanted to map all of the marijuana dispensaries across San Francisco. Let’s download a .csv of these addresses from the Github site, then take a look at the dataset.
download.file("https://raw.githubusercontent.com/pjames-ucdavis/SPH215/refs/heads/main/san_francisco_active_marijuana_retailers.csv", "san_francisco_active_marijuana_retailers.csv", mode = "wb")
sf_mj <- read_csv("san_francisco_active_marijuana_retailers.csv")
## Rows: 33 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): License Number, License Type, Business Owner, Business Structure, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sf_mj)
## # A tibble: 6 × 10
## `License Number` `License Type` `Business Owner` `Business Structure`
## <chr> <chr> <chr> <chr>
## 1 C10-0000614-LIC Cannabis - Retailer Li… Terry Muller Limited Liability C…
## 2 C10-0000586-LIC Cannabis - Retailer Li… Jeremy Goodin Corporation
## 3 C10-0000587-LIC Cannabis - Retailer Li… Justin Jarin Corporation
## 4 C10-0000539-LIC Cannabis - Retailer Li… Ondyn Herschelle Corporation
## 5 C10-0000522-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## 6 C10-0000523-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## # ℹ 6 more variables: `Premise Address` <chr>, Status <chr>,
## # `Issue Date` <chr>, `Expiration Date` <chr>, Activities <chr>,
## # `Adult-Use/Medicinal` <chr>
OK, some interesting columns there, and we have Premise Address as a column that we might want to make spatial. Let’s look closer at that.
head(sf_mj$`Premise Address`)
## [1] "2165 IRVING ST san francisco, CA 94122 County: SAN FRANCISCO"
## [2] "122 10TH ST SAN FRANCISCO, CA 941032605 County: SAN FRANCISCO"
## [3] "843 Howard ST SAN FRANCISCO, CA 94103 County: SAN FRANCISCO"
## [4] "70 SECOND ST SAN FRANCISCO, CA 94105 County: SAN FRANCISCO"
## [5] "527 Howard ST San Francisco, CA 94105 County: SAN FRANCISCO"
## [6] "2414 Lombard ST San Francisco, CA 94123 County: SAN FRANCISCO"
OK that column looks like what we want to geocode. But how do we take
these addresses and make them into spatial information? We have to
geocode them! To do so, we will use the tidygeocoder
package in R. But first, we see that the addresses look a little
strange. The address county is always “County: SAN FRANCISCO” so we will
gsub() out that entire string.
sf_mj$`Premise Address` <- gsub(" County: SAN FRANCISCO",
"", sf_mj$`Premise Address`)
head(sf_mj$`Premise Address`)
## [1] "2165 IRVING ST san francisco, CA 94122"
## [2] "122 10TH ST SAN FRANCISCO, CA 941032605"
## [3] "843 Howard ST SAN FRANCISCO, CA 94103"
## [4] "70 SECOND ST SAN FRANCISCO, CA 94105"
## [5] "527 Howard ST San Francisco, CA 94105"
## [6] "2414 Lombard ST San Francisco, CA 94123"
That looks much better.
Now let’s give a try to geocoding these addresses with the
tidygeocoder package. We will use the
geocode() function to add a latitude and longitude to each
of our addresses in the Premise Address column. We will use the
Open Street Map address database by specifying
method = "osm". This will take about a minute to run, so be
patient!
sf_mj_geo <- geocode(sf_mj, "Premise Address",
method = "osm")
## Passing 33 addresses to the Nominatim single address geocoder
## Query completed in: 37.9 seconds
head(sf_mj_geo)
## # A tibble: 6 × 12
## `License Number` `License Type` `Business Owner` `Business Structure`
## <chr> <chr> <chr> <chr>
## 1 C10-0000614-LIC Cannabis - Retailer Li… Terry Muller Limited Liability C…
## 2 C10-0000586-LIC Cannabis - Retailer Li… Jeremy Goodin Corporation
## 3 C10-0000587-LIC Cannabis - Retailer Li… Justin Jarin Corporation
## 4 C10-0000539-LIC Cannabis - Retailer Li… Ondyn Herschelle Corporation
## 5 C10-0000522-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## 6 C10-0000523-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## # ℹ 8 more variables: `Premise Address` <chr>, Status <chr>,
## # `Issue Date` <chr>, `Expiration Date` <chr>, Activities <chr>,
## # `Adult-Use/Medicinal` <chr>, lat <dbl>, long <dbl>
Hmm, looks like some of our addresses have an NA for
their lat and long. Let’s take a closer look.
summary(sf_mj_geo$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 37.71 37.75 37.78 37.77 37.78 37.80 11
summary(sf_mj_geo$long)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -122.5 -122.4 -122.4 -122.4 -122.4 -122.4 11
Looks like we have 11 addresses missing lat and 11 missing long. Let’s try this again using a different geocoding database called arcgis.
sf_mj_geo_arc <- geocode(sf_mj, "Premise Address",
method = "arcgis")
## Passing 33 addresses to the ArcGIS single address geocoder
## Query completed in: 17.5 seconds
head(sf_mj_geo_arc)
## # A tibble: 6 × 12
## `License Number` `License Type` `Business Owner` `Business Structure`
## <chr> <chr> <chr> <chr>
## 1 C10-0000614-LIC Cannabis - Retailer Li… Terry Muller Limited Liability C…
## 2 C10-0000586-LIC Cannabis - Retailer Li… Jeremy Goodin Corporation
## 3 C10-0000587-LIC Cannabis - Retailer Li… Justin Jarin Corporation
## 4 C10-0000539-LIC Cannabis - Retailer Li… Ondyn Herschelle Corporation
## 5 C10-0000522-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## 6 C10-0000523-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## # ℹ 8 more variables: `Premise Address` <chr>, Status <chr>,
## # `Issue Date` <chr>, `Expiration Date` <chr>, Activities <chr>,
## # `Adult-Use/Medicinal` <chr>, lat <dbl>, long <dbl>
summary(sf_mj_geo_arc$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.71 37.76 37.77 37.77 37.78 37.80
summary(sf_mj_geo_arc$long)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -122.5 -122.4 -122.4 -122.4 -122.4 -122.4
Woohoo! No missingness. Love to see it. OK, let’s plot these data and see how they look.
plot(sf_mj_geo_arc$long, sf_mj_geo_arc$lat)

We are in business! We have taken addresses and converted them into
latitude and longitude! I think we need a badge! 
Bonus exercise! Let’s take these addresses and reverse geocode
them. That’s just a fancy way of saying that we will take latitude
and longitude data and convert it into readable addresses. We use the
aptly named function reverse_geocode and specify which
columns to look at (lat and long), the method we want
for geocoding, and what we want the address column to be named.
Then we select out some columns that we aren’t really
interested in. Remember, we are doing this the tidy way
so we are using %>% pipes.
reverse <- sf_mj_geo_arc %>%
reverse_geocode(lat = lat, long = long, method = 'arcgis',
address = address_found) %>%
select(-`Business Owner`,-`Business Structure`,-`License Number`,-`License Type`,-Status,-`Issue Date`,-`Expiration Date`,-Activities,-`Adult-Use/Medicinal`)
## Passing 33 coordinates to the ArcGIS single coordinate geocoder
## Query completed in: 19.1 seconds
head(reverse)
## # A tibble: 6 × 4
## `Premise Address` lat long address_found
## <chr> <dbl> <dbl> <chr>
## 1 2165 IRVING ST san francisco, CA 94122 37.8 -122. Smokin D's BBQ, 2181 Irvi…
## 2 122 10TH ST SAN FRANCISCO, CA 941032605 37.8 -122. Urbana Weed Dispensary SO…
## 3 843 Howard ST SAN FRANCISCO, CA 94103 37.8 -122. Urbana Geary, 843 Howard …
## 4 70 SECOND ST SAN FRANCISCO, CA 94105 37.8 -122. Cannavine Cannabis Dispen…
## 5 527 Howard ST San Francisco, CA 94105 37.8 -122. Fixed, 527 Howard St, Ste…
## 6 2414 Lombard ST San Francisco, CA 94123 37.8 -122. Mile 7.6 US Hwy 101 N, Sa…
Looking at Premise Address and address_found we can see that we did pretty well! Not perfect, but most are the right address or a few doors down. Well done!
Although there are a few ways to work with vector spatial data in R, we will focus on the sf package in this course. The majority of spatial folks in R have shifted to sf for vector data, and so it makes sense to focus on it in the class.
Processing spatial data is very similar to nonspatial data thanks to the package sf, which is tidy friendly. sf stands for simple features. The Simple Features standard defines a simple feature as a representation of a real world object by a point or points that may or may not be connected by straight line segments to form lines or polygons. A feature is thought of as a thing, or an object in the real world, such as a building or a tree. A county can be a feature. As can a city and a neighborhood. Features have a geometry describing where on Earth the features are located, and they have attributes, which describe other properties.
Now let’s get our hands dirty working with some spatial data.
For this lab, we will primarily be working with the MapGAM package. If you go to the link, you can read the reference manual on the various datasets available in the package. For this lab, we will mainly be working with the CAdata dataset. While they are based on real patterns expected in observational epidemiologic studies, these data have been simulated and are for teaching purposes only. The data contain 5000 simulated ovarian cancer cases. This is a cohort with time to mortality measured, but for the purposes of our class, we will conduct simple tabular analyses looking at associations between spatial exposures with mortality at end of follow-up.
The CAdata dataset contains the following variables
So let’s bring in the CAdata dataset and have a look at it.
We don’t have to do much because it’s part of the
MapGAM package and quickly loads in with the
data() command.
data(CAdata)
ca_pts <- CAdata
summary(ca_pts)
## time event X Y
## Min. : 0.004068 Min. :0.0000 Min. :1811375 Min. :-241999
## 1st Qu.: 1.931247 1st Qu.:0.0000 1st Qu.:2018363 1st Qu.: -94700
## Median : 4.749980 Median :1.0000 Median :2325084 Median : -60386
## Mean : 6.496130 Mean :0.6062 Mean :2230219 Mean : 87591
## 3rd Qu.: 9.609031 3rd Qu.:1.0000 3rd Qu.:2380230 3rd Qu.: 318280
## Max. :24.997764 Max. :1.0000 Max. :2705633 Max. : 770658
## AGE INS
## Min. :25.00 Mcd: 431
## 1st Qu.:53.00 Mcr:1419
## Median :62.00 Mng:2304
## Mean :61.28 Oth: 526
## 3rd Qu.:71.00 Uni: 168
## Max. :80.00 Unk: 152
OK, so the variables look great. Is it a spatial dataset that can be
recognized by R? Not just yet. We know that X is likely some
sort of longitude column and Y is likely some sort of latitude
column, although they don’t exactly look right. We have to tell R that
the X and Y coordinates are spatial data using the st_as_sf
function in sf. With this command, we can specify which
coordinates R should look at for longitude and latitude with the
coords=c() function.
ca_pts <- st_as_sf(CAdata, coords=c("X","Y"))
Let’s check the coordinate reference system (CRS) using the
st_crs command in the sf package.
st_crs(ca_pts)
## Coordinate Reference System: NA
Hmmm, NA. That still doesn’t look good. So how do we make this a spatial file? We will need to add a CRS.
Let’s add a CRS by using st_set_sf from the
sf package. We get the CRS for this dataset from the
MapGAM documentation (don’t worry–it took me forever to find this, but
usually this is much easier to find). Then we will double check the
CRS.
#Load the projection into an object called ca_proj
ca_proj <- "+proj=lcc +lat_1=40 +lat_2=41.66666666666666
+lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000
+y_0=500000.0000000002 +ellps=GRS80
+datum=NAD83 +units=m +no_defs"
#Set CRS
ca_pts_crs <- st_set_crs(ca_pts, ca_proj)
#Look at dataset
summary(ca_pts_crs)
## time event AGE INS
## Min. : 0.004068 Min. :0.0000 Min. :25.00 Mcd: 431
## 1st Qu.: 1.931247 1st Qu.:0.0000 1st Qu.:53.00 Mcr:1419
## Median : 4.749980 Median :1.0000 Median :62.00 Mng:2304
## Mean : 6.496130 Mean :0.6062 Mean :61.28 Oth: 526
## 3rd Qu.: 9.609031 3rd Qu.:1.0000 3rd Qu.:71.00 Uni: 168
## Max. :24.997764 Max. :1.0000 Max. :80.00 Unk: 152
## geometry
## POINT :5000
## epsg:NA : 0
## +proj=lcc ...: 0
##
##
##
#Check the CRS
st_crs(ca_pts_crs)
## Coordinate Reference System:
## User input: +proj=lcc +lat_1=40 +lat_2=41.66666666666666
## +lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000
## +y_0=500000.0000000002 +ellps=GRS80
## +datum=NAD83 +units=m +no_defs
## wkt:
## PROJCRS["unknown",
## BASEGEOGCRS["unknown",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6269]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]]],
## CONVERSION["unknown",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",39.3333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-122,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",40,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",41.6666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",2000000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]]]
Nice! We have a spatial dataset. That geometry column is how
sf stores the geographic data, and latitude and
longitude there are looking a bit more like we would expect. And we
definitely have a full CRS with all sorts of info. OK, let’s plot our
data to make sure they look spatial! We will use the
tmap package to plot the points. We will first specify
the tmap_mode of “view” which is interactive. There’s also
a “plot” option which is nice for making exportable figures. We will
then create an object called cancer_map and then add a layer
with tm_shape(). This allows us to combine several maps
into one, or to add layers on top of each other. Then we have to specify
a level of that layer to display. Here we will use
tm_dots() to to plot the points. In our options, we specify
the size with size=.
tmap_mode("view")
## ℹ tmap modes "plot" - "view"
## ℹ toggle with `tmap::ttm()`
cancer_map = tm_shape(ca_pts_crs) + tm_dots(size=0.5)
cancer_map
## Registered S3 method overwritten by 'jsonify':
## method from
## print.json jsonlite
Let’s play around with some of the options. We can change the color
with the fill= option. We can make the dots smaller by
specifying the size= option. And we can change the
transparency of the points with the fill_alpha= option.
cancer_map_small = tm_shape(ca_pts_crs) +
tm_dots(fill = "blue", size = 0.3, fill_alpha = 0.5)
cancer_map_small
Let’s map the points color coded by the variable event, or
whether or not the participant died over followup. We do that by
specifying that the color (fill=) is based on the column
event. We have to specify that event is a categorical
variable with fill.scale=tm_scale_categorical().
cancer_map_events = tm_shape(ca_pts_crs) +
tm_dots(size=0.3, fill="event", fill.scale=tm_scale_categorical())
cancer_map_events